A powerful machine learning approach to identify interactions of differentially abundant gut microbial subsets in patients with metastatic and non-metastatic pancreatic cancer

ABSTRACT Pancreatic cancer has a dismal prognosis, as it is often diagnosed at stage IV of the disease and is characterized by metastatic spread. Gut microbiota and its metabolites have been suggested to influence the metastatic spread by modulating the host immune system or by promoting angiogenesis. To date, the gut microbial profiles of metastatic and non-metastatic patients need to be explored. Taking advantage of the 16S metagenomic sequencing and the PEnalized LOgistic Regression Analysis (PELORA) we identified clusters of bacteria with differential abundances between metastatic and non-metastatic patients. An overall increase in Gram-negative bacteria in metastatic patients compared to non-metastatic ones was identified using this method. Furthermore, to gain more insight into how gut microbes can predict metastases, a machine learning approach (iterative Random Forest) was performed. Iterative Random Forest analysis revealed which microorganisms were characterized by a different level of relative abundance between metastatic and non-metastatic patients and established a functional relationship between the relative abundance and the probability of having metastases. At the species level, the following bacteria were found to have the highest discriminatory power: Anaerostipes hadrus, Coprobacter secundus, Clostridium sp. 619, Roseburia inulinivorans, Porphyromonas and Odoribacter at the genus level, and Rhodospirillaceae, Clostridiaceae and Peptococcaceae at the family level. Finally, these data were intertwined with those from a metabolomics analysis on fecal samples of patients with or without metastasis to better understand the role of gut microbiota in the metastatic process. Artificial intelligence has been applied in different areas of the medical field. Translating its application in the field of gut microbiota analysis may help fully exploit the potential information contained in such a large amount of data aiming to open up new supportive areas of intervention in the management of cancer.


Introduction
Global statistics updated to 2020 assign pancreatic cancer (PC) in seventh place among the causes of cancer-related death. 1 Indeed, compared to all other cancers, the 5-year survival rate for PC is among the lowest (10%), 2 and it further drops to approximately 3% when considering stage IV disease characterized by metastatic spread. 3Such poor data are mainly linked to the subtle and insidious onset of PC and its late diagnosis. 4In fact, screening methods are not recommended unless in highrisk subjects, in the absence of specific biochemical markers and in the presence of non-typical symptoms in the initial stages of the disease, making early diagnosis difficult to achieve. 5,6][9] Metastases development is a complex multi-step process that requires cancer cells to detach from the primary tumor, invade the adjacent tissue, migrate into the blood vessels, survive in the bloodstream, leave the circulatory system and colonize a distant body site. 10,11Therefore, for productive metastasis establishment, cancer cells must cope with a number of hurdles, and the participation of several actors from the tumor microenvironment (TME) as well as from outside the tumor is required. 12A study by Fu et al. recently demonstrated that, as a part of the TME, intra-tumor microbiota influence the process of breast cancer metastasis by helping circulating tumor cells reorganize their cytoskeleton to resist mechanical stress in circulation. 13However, it has been proposed, that the gut microbiota and its metabolites may be involved at a distance in the process of tumor spread through different mechanisms.6][17] Another important contribution of gut microbiota to metastasis is the promotion of angiogenesis through stimulation of vascular endothelial growth factor (VEGF) expression by lipopolysaccharide (LPS) and Toll-like receptor-4 (TLR-4). 18,19espite this evidence linking the gut microbiota to the metastatic process, gut bacterial communities characterizing metastatic cancer carriers remain largely uninvestigated.
In the current study, the gut microbial composition of a cohort of patients with either local or metastatic PC was profiled with the aim of identifying bacterial patterns that could discriminate between these two conditions.Moreover, to gain further insight into how the gut microbiota can predict the risk of metastases, a machine learning approach was also performed to determine the interactions of the relative abundance of microorganisms, which may represent bacterial signatures of localized and metastatic PC.

Study participants
Subjects were recruited between June 2019 and April 2022 from the Fondazione IRCCS "Casa Sollievo della Sofferenza" Hospital, San Giovanni Rotondo, Italy, using the same protocols for each patient for collection, processing, and conservation of biological samples.Ethical approval from the IRCCS "Casa Sollievo della Sofferenza" Hospital, under Ethical Committee approval number N.184/ CE and written informed consent (IC) were obtained from the study participants.The study was carried out in accordance with the principles of good clinical practice, the Declaration of Helsinki, and in compliance with national legislation.Epidemiological and lifestyle data as well as disease stage, follow-up data, diet, drinking, and smoking habits, were collected at the time of recruitment when the stool specimen was also collected.All data were retrieved from hospital records and collected through direct interviews with the patients.Only patients with PC identified at the time of diagnosis prior to any cancer treatment were included in the study and were classified into two groups according to the presence of metastasis (i.e., non-metastatic and metastatic PC groups).Both patient groups were examined clinically, including assessment of basic anthropometric data and evaluation of blood and serum tests.In addition, patient follow-up data (such as treatment regimen and vital status) were retrieved after the recruitment date for descriptive purposes only and were not utilized in the statistical analyses planned to pursue the purpose of this study, which aimed to assess the presence of differentially abundant bacterial patterns between metastatic and non-metastatic patients, collected at recruitment.

Sample collection and DNA extraction
Each study participant provided a fresh stool sample in a sterile tube which was then stored at −80°C until use.DNA was isolated from a human fecal sample using the QIAamp Fast DNA Stool Mini Kit (Qiagen, Milan, Italy, Cat.N° 51604) according to the manufacturer's instructions.To optimize the ratio of non-human to human DNA, cells that were difficult to dissolve (such as Gram-positive bacteria) were lysed by heating the fecal suspension at 90°C for 5 min.DNA was checked for concentration and purity at the end of the isolation protocol and stored at −30°C until use.

Next-generation sequencing and analysis of bacterial 16S rRNA gene
For each sample, 12.5 ng of DNA obtained through the above procedure was used to amplify the V3-V4 region of the 16S rRNA gene using KAPA HiFi HotStart Ready Mix (Roche Diagnostics, Milan, Italy, Cat.N° 07958935001) and the following primers selected by Klindworth 20 with Illumina adapters added: forward primer: 5'-TCGTCGGCAGCGTCAGATGTGTATAAGAGAC-AGCCTACGGGNGGCWGCAG reverse primer: 5'-GTCTCGTGGGCTCGGAGATGTGTATAAGAGA-CAGGACTACHVGGGTATCTAATCC.Samples were then barcoded with the Nextera XT Index Kit (Illumina, Milan, Italy, Cat.N° FC-131-1002), as previously described. 21Next, the libraries were purified, quantified using a Qubit™ Flex Fluorometer (Thermo Scientific, Milan, Italy), pooled and sequenced in pairs (2 × 300 cycles) on an Illumina MiSeq platform (San Diego, CA, USA).FASTQ files generated by MiSeq were deposited in ArrayExpress under the code E-MTAB-12513 and de-multiplexed and analyzed using the 16S Metagenomics GAIA 2.0 software, Sequentia Biotech, Barcelona, Spain (2019).Read pairs were quality-controlled (i.e., trimming, clipping and adapter removal) based on FastQC and BBDuk and, to obtain the taxonomic profile of each sample, they were mapped with BWA-MEM against the 16S reference database available in NCBI GenBank.Specifically, all full-length sequences of the 16S rRNA gene from all prokaryotes are included except those longer than 3000 bp which are filtered out, since the gene is expected to be shorter.A prediction of the microbial functions was carried out using the Tax4Fun2 package in R. The Mann-Whitney test was used to analyze the differences of the Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic pathways between the two groups, which were then depicted using STAMP software.

Metabolomics analysis
The metabolomics and lipidomics analyses of feces from a subgroup of 20 metastatic and non-metastatic patients were performed using ultra high pressure liquid chromatography (UPLC 1290 system, Agilent Technologies) directly connected to mass spectrometry (TripleTOF 5600+ mass spectrometer, SCIEX equipped with an electrospray ionization source (ESI)).Detailed protocol is reported in Supplementary Material 3.

Statistical methods
Patient characteristics were reported as medians along with interquartile ranges (i.e., first-third quartiles) and observed and relative frequencies for continuous and categorical variables, respectively.For each continuous variable, the assumption of normality distribution was assessed using the Shapiro-Wilks test, and if the condition was met, mean ± standard deviation (SD) was reported instead of median.Comparisons between patients with and without metastases at enrollment were performed using a two-sample t-test (or Mann-Whitney U test as appropriate) and chi-square test with Yates' continuity correction (or Fisher exact test as appropriate) for continuous and categorical variables, respectively.Patients follow-up was defined as the time elapsed between the date of enrollment and death or the date of the last visit, whichever occurred first.The annual mortality rate was reported as the number of death events per 100 person-years and a Poisson regression model was used to assess the differences between the two groups.Stacked bar charts were used to show the gut microbiota composition (i.e., mean relative abundance %) at the phylum, family, genus, and species levels between patient groups.To identify clusters of bacterial populations such that the linear combination of their abundances is differential between patients with and without metastases, the PEnalized LOgistic Regression Analysis (PELORA) 22 was performed on logit-transformed and standardized relative abundance (i.e.4][25] When the relative abundance was exactly 0%, to compute a Z-score this quantity was replaced by 0.0001% (i.e.half of the lowest relative abundance found in the entire dataset).The mean Z-scores of all the bacteria included in a cluster were defined as "the centroid of the cluster".Multiple clusters of bacterial populations at different taxonomic levels were identified using this algorithm.However, to ease the interpretation of the results, a maximum of two informative clusters were considered for each scenario.Each identified cluster included bacterial populations whose relative abundances were consistently higher (or lower) in metastatic patients than in non-metastatic patients.The optimal penalty parameter value was chosen as the one that achieved the lowest median misclassification rate, computed after several data resamplings (bootstrapping).Box plots and scatterplots of the Z-scores computed at cluster centroids and heatmaps of the relative bacteria abundance (%) identified by PELORA within each cluster are shown at different taxonomic hierarchies.As stated by Dettling and Buhlmann, 22 the PELORA algorithm "aims to identify multiple class-separating groups (i.e., clusters) such that each group exhibits a good trade-off between expected differential expression and conditional variance of the group mean, and such that the groups together contribute most in predicting the response".In short, to form a new group of bacteria, their strategy proceeds in a 'cautious' manner, starting from scratch and relying on the incremental growth of the group by adding one bacterium after another until the L2 penalized negative log-likelihood function value, defined as the sum of the negative binomial log-likelihood (i.e. likelihood of the data) and the magnitude of the coefficients (L2 "ridge" penalty), in the group stabilizes and cannot be further minimized at the chosen penalty parameter.Once a group was found to be terminated, a new group was started, and the composition of the former groups remained unchanged, while they could still have an effect on the construction of the new group.Please refer to section 3.4.2 of the cited article 22 for the detailed steps concerning the implementation of the PELORA algorithm.
Furthermore, to delve deeper into the data, the iterative Random Forest (iRF) algorithm 26 was performed on the same structured dataset as the PELORA analysis (i.e. one record per patient and a set of variables in which the presence of metastasis, clinical information and the relative abundance of each bacterium with respect to each taxonomic classification, respectively, was recorded).It should be noted that the analysis performed with iRF was independent of that performed with the PELORA algorithm.The iRF is a generalization of the Random Forest (RF) algorithm and is commonly used to train a set of feature-weighted decision trees to detect stable and high-order interactions. 26In the iRF algorithm, a RF is run iteratively from one to 100 times and, at the last iteration, the feature-weighted RF is 'mapped' extracting and applying its decision rules to convert continuous or categorical features into binary variables, a crucial step as it allows the identification of all prevalent interactions.The proportion of times (out of several bootstrap samples) an interaction appears defines a "stability score" (i.e., 0 = totally unstable interaction, 1 = totally stable interaction).To enable iRF training, a "tuning phase" was performed to set its internal parameters.A similar application of the method has been extensively reported in, 27 and additional information about iRF algorithms along with its "tuning phase" can be found in the Supplemental Statistical Methods.Accumulated Local Effects (ALE) were performed to depict the functional relationship between the relative abundances of the most important bacteria and the probability of having metastases whereas Partial Dependence Plots (PDP) were performed to investigate the "most stable" interactions between two important bacteria, locating all combinations of relative abundance at high risk of metastasis.The discriminatory ability of the iRFs was assessed by the Area Under the ROC Curve (AUC) on the out-of-bag data, along with its 95% CI computed after 1000 stratified bootstrap replicates.Statistical significance was set at p < 0.05.All statistical analyses and plots were performed using R (R Development Core Team 2008, version 4.2, packages: supclust, iRF, iml, ggplot2, ggpubr, ComplexHeatmap, pROC, igraph, ggraph).Concerning the association analysis, pairwise Spearman correlations between the gut bacteria discriminating the two cohorts of patients according to AI and the metabolomics compounds were performed, and the Spearman's rank correlation coefficients (r) were calculated.Results were considered significant when p < 0.05.

Sample characteristics
Fifty-three patients with pancreatic tumors were enrolled in this study, including 25 with non-metastatic tumors (mean age at enrollment: 65.7 years) and 28 with metastatic tumors (mean age at enrollment: 68.7 years).For brevity, the acronyms "PC" and "PC met" were used to refer to these two groups.Anthropometric, demographic and clinical data are presented in Table 1.The two groups were balanced with the respect to all examined characteristics except for the variables such as: a) characteristics of the tumor, including tumor stage (p < .001)which was more advanced in metastatic patients (85.7% of PC met versus 0% of the PC was at stage IV), jaundice (p = .004)and endoprosthesis positioning (p = .010)which were both more abundant in PC versus PC met patients; b) treatment regimen (after stool collection) like therapeutic plan (p = .002),neo-adjuvant chemotherapy (p < .001)and radiotherapy (p = .001)which were mainly administered to PC group whereas PC met mainly underwent, first-line chemotherapy (p = .016)and radiotherapy (p = .001);c) biochemical tests at the enrollment visit, notably hemoglobin (p = .040),red blood cells (p = 0.022) and albumin (p = 0.041) which were lower in PC than in PC met patients, bilirubin (p = .014),aspartate aminotransferase (p = .033),GGT (p = .046)and ALP (p = .038)which were higher in PC than in PC met patients and d) mortality rate, namely, followup time from enrollment which was shorter in PC met (p = .003)and annual mortality rate which was higher in PC met (p = .002).

Comparison of fecal microbiota composition between PC patients with or without metastases
To highlight any differences in the composition of the intestinal microbiota in the PC and PC met groups, 16S rRNA gene sequencing was performed, producing an average of 183,749.283(±129203.7919)read pairs for each of the 53 study participants.Alpha-diversity indices (Shannon and Chao1) were calculated both at the genus and the species level in order to analyze the within-sample diversity of the bacterial profiles, but no significant difference emerged between the two groups (data not shown).Figure 1 shows the gut microbiota composition at the phylum (A), family (B), genus (C) and species (D) levels, expressed as relative abundance (%) in the PC and PC met groups.Taxonomic analysis data were used to carry out the PELORA algorithm to detect differences in the microbial population in the two groups of patients and identify specific bacterial patterns.All identified clusters are reported in Table 2.The phyla Tenericutes, Bacteroidetes and Nitrospinae were found to be part of a bacterial cluster and were all enriched in the PC met group compared to the PC group.Another cluster was identified at the family level, including the families of Anaeroplasmataceae, Sutterellaceae, Methanomassi llicoccaceae, Pasteurellaceae, Porphyromonadaceae, Lactobacillaceae, Oscillospiraceae, Bacteroidaceae, Enterococcaceae, Fusobacteriaceae, Morganellaceae and Xanthomonadaceae, which were all increased in PC met compared to PC (mean Z-scores of cluster centroid: −0.149 vs 0.133, p < .001).When comparing the two groups at the genus level, the linear combination of abundances revealed another cluster including Provencibacterium, Porphyromonas, Raoultella, Pseudoramibacter, Kluyvera, Slackia and Leptotrichia, all of which decreased in PC met with respect to PC (mean Z-scores of cluster centroids: 0.241 vs. −0.216,p < .001).At the species level, the PELORA algorithm identifies two clusters.The first one included Coprobacter secundus, Turicibacter sanguinis, Phascolarctobacterium succinatutens, Bacteroides thetaiotaomicron, Pseudomonas aeruginosa, Intestinimonas timonensis, Lactobacillus  .092(

Continued)
GUT MICROBES    The "Others" category includes all bacteria whose mean relative abundance is less than 1% (at phylum and family level), 0.5% (at genus level) and 0.1% (at species level), respectively.As expected, the percentage of bacteria included in the "Others" category tends to increase as more specific taxa levels are considered.multisaccharivorax, Desulfovibrio legallii, Eikenella corrodens, Clostridium sp.14505, and unknown bacteria, which were decreased in PC met compared to PC patients, with the exception of C. disporicum and P. oris (mean Z-scores of cluster centroids: 0.093 vs −0.083, p = 0.003).At this stage of the analysis, it is important to emphasize that two sample t-tests were performed for the mere purpose of providing a general descriptive comparison of cluster centroids and not for drawing inferential conclusions.For each taxonomic level, Figure 2 shows the distribution of Z-scores computed at the cluster centroids in the comparison between the PC group and the PC met group.Interestingly, at the species level, patients without metastases exhibited lower Z-scores at centroid one and higher Z-scores at centroid two (Figure 2d).Moreover, a scatter plot of the Z-scores computed within each cluster showed that the two species clusters clearly discriminated PC from PC met patients (Figure 2d).The heatmaps in Figure 3 show the relative abundance of the bacterial taxa composing each cluster at the phylum (A), family (B), genus (C) and species (D) levels for each study participant.

Machine learning approach to identify microbial patterns predictive of metastases
Results from iRFs at the family, genus, and species level are shown in Figures 4-6, respectively.The optimal number of iterations as well as the regularization factor for each iRF were detected in the "tuning phase" as shown in Supplemental statistical methods.
In the Variable Importance (VIMP) plot (Figure 4a), the families Rhodospirillaceae, Clostridiaceae, Peptococcaceae, Eggerthellaceae, Hafniaceae were found to provide the highest contribution (i.e., Relative VIMP > 10%) in detecting the presence of metastases, while Enterobacteriaceae and Oligosphaeraceae provided the least contribution to this discrimination.Sensitivity and specificity achieved at each possible cutoff of the predicted probabilities (of having metastases) computed by the iRF from the out-of-bag data were shown in the ROC curve (Figure 4b), in which AUC value of 0.727 (bootstrap 95% CI: 0.583-0.866)means that taking a metastatic PC patient and a non-metastatic PC patient at random, the probability that the algorithm assigns a higher probability of metastases in the metastatic patient (with respect to the non-metastatic patient) As a first step, the relative abundance (%) of each bacterium was logit transformed (so that values can theoretically range from negative to positive infinity) and, as a second step, the Z-score was computed by standardizing the transformed variable (i.e., taking the variable values, subtracting its mean and dividing by its SD).The centroid is calculated as the average Z-score of all the bacteria within each cluster.# All p-values were derived from the parametric two-sample t-test on Z-scores, with the exception of those marked as " § ", which were derived from the Mann-Whitney U test.The latter was performed in presence of no variance in one of the two groups (i.e., when the group has all values equals to 0% -denoted as "Absent". is 72.7%, although in terms of relative abundance none of the families showed statistically significant variations in the two groups (Figure 4c).ALE was estimated from iRF on the most important variables with the aim of showing the relationship between the relative abundance levels of specific bacteria and the mean predicted probability of having metastases with respect to the overall mean predicted probability defined at ALE = 0 (i.e., average risk).Specifically, the Rhodospirillaceae family showed a sigmoid curve in which lower levels (0-0.001%) were associated with a higher probability of having metastases (ALE = 0.30, i.e. 30% higher than the average risk), whereas higher levels (0.01-0.25%) were associated with a lower probability of metastases (ALE = −0.10,i.e. 10% lower than the average risk).A non-linear and non-monotonic curve was assumed for the Clostridiaceae family, in which the probability of metastases decreased within the 1.0-2.5% range, whereas it was higher above and below this interval.A near linear curve was also observed for Peptococcaceae, for which higher abundance correlated with a higher probability of metastases, and for Eggerthellaceae and Hafniaceae, for which increased levels predicted a lower probability of metastases, as compared to the average risk (Figure 4d).The network plot of interactions with the highest stability among the families is graphically represented in Figure 4e (stability score >0.50), with the main interactions involving Rhodospirillaceae, Peptococcaceae, Eggerthellaceae and Clostridiaceae.
The aforementioned dual interactions were integrated into the PDPs (Figure 4f), highlighting the regions in which patients with metastases were more likely to be found.Remarkably, patients with Rhodospirillaceae abundance within the range of 0.00031-0.003%together with Peptococcaceae within 0.002-0.008%,Clostridiaceae within 0.19-2.60%,Eggerthellaceae within 0.001-0.042%,Hafniaceaea <0.01%, Enterobacteriaceae <30.4% showed a 70-80% probability of having metastases at the time of enrollment.At the genus level, Porphyromonas and Odoribacter provided the highest contribution to detecting metastases, followed by Mailhella, Papillibacter, Slackia, Fusobacterium and Raoultella whereas Anaerostipes and Butyricicoccus contributed the least (Figure 5a).In Figure 5b, the ROC curve is reported, which denoted a fair quality level of the iRF discrimination ability: AUC = 0.727 (bootstrap 95% CI: 0.590-0.849).The relative abundance of Porphyromonas and Mailhella genera was significantly decreased for the former and enriched for the latter in the PC met group compared the PC group (p = .036and p = .044,respectively) (Figure 5c).Regarding ALE plots, one of the lowest Porphyromonas levels (close to 0.005%) was associated with a higher probability of metastases (ALE = 0.30, i.e. 30% higher than the average risk probability).Non-monotonic curves were found for the Odoribacter and Papillibacter.For Mailhella and Fusobacterium, an increase in the probability of metastasis was observed as the relative abundance plateaued.In contrast, for Slackia and Raoultella genera, after the initial plateau phase, the probability of metastases started to decrease slightly (Figure 5d).The most stable interaction pathways among the genera are shown in Figure 5e.At this taxonomic level, PDPs (Figure 5f) suggested that patients with Porphyromonas abundance lower than 0.01% together with Odoribacter <1.2%, Raoultella <0.01%, Slackia <0.02%, Papillibacter <0.025%, or Fusobacterium >0.001% showed approximately 70-80% probability of having metastases at the time of enrollment.Worth of note was also the case of patients with Papillibacter levels >0.001% and Fusobacterium levels between 0.005% and 0.03%, which have more than 60% probability of having metastases.A further interaction involving Fusobacterium was the one with Anaerostipes: Fusobacterium abundance higher than 0.001% together with Anaerostipes levels greater than 0.04% showed an approximately 55-60% probability of having metastases.At the species level, Anaerostipes hadrus, Coprobacter secundus, Clostridium sp.619, and Roseburia inulinivorans were the most important, followed by Bacteroides barnesiae, Bifidobacterium breve, Bacillus nealsonii, Paraprevotella clara, Roseburia intestinalis, Streptococcus anginosus, Oscillibacter sp.Marseille-P3260, Turicibacter sanguinis, Dialister sp.Oral taxon 502 and seven other species with VIMP < 10% (Figure 6a).The ROC curve shown in Figure 6b shows a good quality level of iRF discriminatory ability: AUC = 0.804 (bootstrap 95% CI: 0.676-0.916).The relative abundance of Coprobacter secundus (p = .002),Clostridium sp.619 (p = .007),Bacillus nealsonii (p = .017),and Turicibacter sanguinis (p = .014)showed a significant increase in the PC met group compared to the PC group, whereas Roseburia intestinalis (p = .022)and Streptococcus anginosus (p = .033)were increased in PC patients compared to PC met patients (Figure 6c).In the ALE plots, Anaerostipes hadrus, Coprobacter secundus, Clostridium sp.619, and Bifidobacterium breve showed a sigmoid curve (Figure 6d).As for Anaerostipes hadrus, Coprobacter secundus and Clostridium sp.619 metastases were less probable at lower levels and then became more probable as the relative abundance increased.In contrast, for Bifidobacterium breve, the probability of metastases was higher at showing the sensitivity and specificity achieved at each possible cut-off of the predicted probabilities computed by the iRF from the out-of-bag data, and hence is not prone to overfitting.The area under the ROC curve (AUC) was corroborated by its 95% confidence interval, computed after 1000 stratified bootstrap replicates.The AUC quantifies the discriminatory power achieved by the iRF, which reflect the one achieved by the all variables reported in the VIMP plot.(c) Boxplots of the relative abundance (%) of all variables detected by the iRF with VIMP > 10%.(d) Accumulated Local Effect (ALE) of all variables detected by the iRF with VIMP > 10%.ALE describes how the variables influence the predicted probability on average.The gray band is a confidence band for the regression line fitted in the estimated ALE points.(e) Network plot of the most stable two-order interactions (i.e. with a stability score ≥0.5).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.
lower abundance levels and then decreased when abundance increased.At this taxonomic level, it was interesting to note that the course of Roseburia inulinivorans represented by a U-shaped curve (Figure 6d).The most stable interaction pathways among the detected species are shown in Figure 6e.At this taxonomic level, PDPs (Figure 6f) suggested that patients with Anaerostipes hadrus abundance greater than 0.032% together with Clostridium sp.619 > 0.002% or Roseburia intestinalis <0.008%, and patients with Clostridium sp.619 > 0.002% together with Bifidobacterium breve <0.001% showed approximately 60-65% probability of having metastases at the time of enrollment.
At all taxonomic levels, it should be noted that the ALE plots were performed on all variables detected by the iRF with VIMP > 10%, whereas the PDPs were produced only for those variables with stable interactions (i.e., with stability score >0.70).showing the sensitivity and specificity achieved at each possible cut-off of the predicted probabilities computed by the iRF from the out-of-bag data, and hence is not prone to overfitting.The Area under the ROC curve (AUC) was corroborated by its 95% confidence interval, computed after 1000 stratified bootstrap replicates.The AUC quantifies the discriminatory power achieved by the iRF, which reflect the one achieved by the all variables reported in the VIMP plot.(c) Boxplots of the relative abundance (%) of all variables detected by the iRF with VIMP > 10%.(d) Accumulated Local Effect (ALE) of all variables detected by the iRF with VIMP > 10%.ALE describes how the variables influence the predicted probability on average.The gray band is a confidence band for the regression line fitted in the estimated ALE points.(e) Network plot of the most stable two-order interactions (i.e. with a stability score ≥0.5).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.

Functional prediction of gut microbial profiles
In order to get a more comprehensive view of the role of gut microbiota in PC metastatization, we performed a prediction of the functional/metabolic capabilities of the microbial communities identified by 16S rRNA sequencing in the two experimental groups, by using the software package Tax4Fun2.As shown in Figure 7, a prediction of bacterial functions at level 1 (A) and 2 (B) of KEGG pathways was performed.Among the main differences, the biosynthesis and metabolism of amino acids (above all aromatic and branched chain amino acids) and the transcriptions function were increased in PC patients, whereas signal transduction was increased in PC met.

Fecal metabolomics analysis
In addition to the functional profile, a metabolomics analysis on fecal samples of a subset of patients from each of the two groups was carried ).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.
out. Figure 8 shows the heatmaps representing the 6 out of 106 polar metabolites (A), the 29 out 545 lipids identified in positive ionization mode (B) and the 14 out of 317 lipids identified in negative ionization mode (C) that were significantly different among the two groups, respectively.As for polar compounds (Figure 8a), glutamic acid was found to be significantly enriched in PC met patients, whereas 4-pyridoxic acid, N-acetylhistidine, tyrosine, cytosine and xanthine were found increased in non-metastatic PC patients.A greater impact was observed on lipid metabolism, in which the most relevant classes found in positive polarity were diacylglycerols (DG) and N-acyl glycines (NAGly).Some DG were found enriched in metastatic patients (DG 40:  (F 18:4), NAGly 36:2;O2| NAGly 18:0;O(FA 18:1)).However, the most significant impact on intestinal lipid composition was observed in negative polarity, where a class of oxidized fatty acids delineated a clear separation between the metastatic and non-metastatic groups, being enriched in the latter.This class includes FA 18:2, FA 18:3 and FA 20:5, known, respectively, as linolic acid, linolenic acid and 5,8,11,14,17-eicosapentaenoic acid (EPA) (Figure 8b).

Correlations between gut microbiota and the fecal metabolome
Both the polar and the lipidic metabolites which were differently represented in PC and PC met were then correlated with the bacteria which best discriminate the two microbial profiles (Figure 9a, b, respectively).Regarding the polar metabolites, 4pyridoxic acid showed a significant negative correlation with Clostridiaceae (r= − 0.56), Peptococcaceae (r= − 0.46), Bacillus nealsonii (r= − 0.52), Oscillibacter sp.Marseille-P3260 (r= − 0.62) and a significant correlation with Bifidobacterium breve (r = 0.67); cytosine was negatively associated with Peptococcaceae (r= − 0.49) and Turicibacter sanguinis (r = − 0.55) and positively associated with Bifidobacterium breve (r = 0.55); tyrosine was negatively correlated with and positively correlated with Turicibacter sanguinis (r= − 0.61) and positively associated with Streptococcus anginosus (r = 0.45).Concerning lipids, a considerable number of both positive and negative correlations were observed.

Discussion
In the current study, a cohort of 53 patients with PC was recruited; 28 of these patients were already metastatic at the time of diagnosis.As expected, the PC met group showed a more advanced disease (the vast majority of patients had stage IV cancer) and a higher mortality rate than non-metastatic PC patients.Interestingly, the two groups differed significantly in terms of jaundice and endoprosthesis use, which were both more frequent in non-metastatic patients.Liver injury caused by endoprosthesis positioning could explain why hepatic biomarkers such as bilirubin, AST, GGT and ALP were remarkably higher than those in metastatic patients.
Metastases in PC represent a huge burden because they prevent patients from accessing potentially curative surgery; therefore, deeply delving into the mechanisms behind this process remains an urgent need.Recent studies suggest that the microbiota may be involved in both cancer progression and metastasis, through its interaction with the host immune system and the production of bacterial metabolites and molecules. 14,18In particular, LPS constituting the cell wall of Gram-negative bacteria has been described to induce epithelial-mesenchymal transition 28,29 and promote angiogenesis through its interaction with TLR-4 on the host cell membranes and VEGF expression. 19,30To our knowledge, this is the first report investigating gut microbiota composition in patients with metastatic and non-metastatic PC.First, the PELORA approach was used to identify patterns of bacteria at different taxonomic levels, identified by 16S sequencing, that best discriminate the two groups.At first glance, an overall increase in Gram-negative bacteria was observed in metastatic PC patients compared to that in non-metastatic ones.Indeed, the taxa whose abundance increased the most in metastatic subjects, such as Tenericutes, Anaeroplasmataceae, Phascolarctobacterium succinatutens, Pseudomonas aeruginosa, Haemophilus parainfluenzae, Fournierella massiliensis, Parabacteroides sp.SN4, were all Gram-negative bacteria, leading us to speculate that an increase in the LPS-TLR4-VEGF pathway could have occurred, promoting angiogenesis and subsequent metastases.To further dissect the potential use of gut microbes and their abundance and to improve their performance in the identification of PC patients at risk of metastases, a machine learning algorithm was also implemented.This approach identified bacterial signatures and reciprocal interactions with good discriminatory power between patients with metastatic and non-metastatic PC.2][33][34] Surprisingly, according to our analysis, the risk of metastasis decreased as abundance of Porphyromonas increased.On the contrary, lower abundances of Fusobacterium slightly decreased the probability of metastatic events, in good agreement with the literature, according to which F. nucleatum promotes metastases in colorectal 35 and esophageal 36 cancers and elicits migration in pancreatic cancer cell lines. 34At the species level, the machine learning algorithm detected Anaerostipes hadrus, Roseburia inulinivorans and Roseburia intestinalis, well-known producers of butyrate, 37,38 which has been reported to inhibit pancreatic cancer cell invasion in vitro. 39Consistently, in our analysis, the abundance of R. inulinivorans and R. intestinalis was inversely correlated with the risk of metastases, whereas A. hadrus was not.However, previous studies, have shown that A. hadrus is overabundant in children with type 1 diabetes mellitus 40 and worsens colitis in mice, 37 suggesting that, although potentially beneficial in healthy conditions, this butyrate producer may be harmful in diseased subjects. 37Interestingly, Bifidobacterium breve higher abundance showed a protective effect against metastases.Although, to the best of our knowledge, no antimetastatic action has been documented yet for this bacterium, B. breve is generally regarded as a beneficial probiotic.It showed anti-tumor properties in mice with oral carcinoma, 41 and it was associated with increased progression-free survival in lung cancer patients. 42Finally, among the main species with discriminatory power between the two groups we also found Streptococcus anginosus, whose abundance was reduced in metastatic compared to non-metastatic patients.This result is in line with a previous in vitro study in which S. anginosus supernatant inhibited the proliferation, migration and invasion ability of oral squamous cell carcinoma. 43AI, in which machine learning represents a field, is already widely applied in different areas of pancreatic cancer, from diagnosis to prognosis, including the prediction of survival, recurrence, metastases, and response to treatment. 44In our cohort of patients, with this approach, it was possible to understand that microbial variations within the intestinal microbiota may contribute to development of metastases in PC patients.Further research could expand the potential applications of this intervention strategy and guide clinicians in discriminating between patients with metastatic and non-metastatic PC.Specifically, by considering not only a single microorganism but also bacterial patterns with already known interactions, important information could be obtained for the development of new therapeutic and/or integrative strategies in PC treatment, as well as indications for improving the diagnostic process.In this regard, it will also be appropriate to pay attention to the relative abundance of microorganisms, since even minimal variations could have completely different effects from those expected.This can pave the way for new diagnostic devices based on new algorithms to support clinicians in their diagnosis.
When metabolomics analysis was performed on fecal samples from a subset of PC and PC met patients, a number of statistically significant changes emerged.Glutamic acid was found significantly enriched in the fecal samples from metastatic patients.Interestingly, glutamate has been implicated in increasing pancreatic cancer cells migration and invasion through binding to specific receptors and activating Kras-MAPK signaling. 45Tyrosine increase in the feces from PC patients without metastasis was consistent with the functional prediction performed on gut microbiota, in which phenylalanine, tyrosine and tryptophan biosynthesis resulted increased as well.Moreover, higher fecal levels of xanthine in the same group were again in agreement with the functional profile, which predicted an increase in purine metabolism in PC patients as compared to PC met.Lipidomics showed a number of statistically significant lipids, capable of highlighting changes between the PC group and the PC met group.Among lipids obtained in positive polarity, a class of DGs was found enriched in patients with metastases.Indeed, DG appears to be involved in several cellular mechanisms such as motility, survival and cell proliferation and an imbalance in its homeostasis or in the functioning of its effectors, especially protein kinase C (PKC), seems to be involved in the progression and development of metastasis. 46In a study on triple-negative breast cancer, high expression of a diacylglycerol kinase ζ (DGKZ) was observed to promote metastasis in vitro and in vivo. 47On the contrary, NAGly were found statistically enriched in patients without metastases.Interestingly, reduced expression of the glycine N-acyltransferase (GLYAT) gene was correlated with increased cell proliferation and increased migratory properties of tumor cells, perhaps due to the activation of PI3K/AKT/Snail signaling, which induces epithelial-mesenchymal transition (EMT). 48he lipidomics analysis conducted in negative polarity highlights a notable increase in the fatty acid class in patients without metastases in agreement with the study by Luo et al. 49 In detail, the level of 5,8,11,14,17-eicosapentaenoic acid (EPA), increased in PC group, was found statistically reduced in the tumor tissue of metastatic patients with colorectal cancer. 50Although the fecal metabolome is the result of both host and gut microbiota metabolism, a remarkable number of significant correlations between metabolites and gut bacteria were recorded.Strikingly, B. breve was the species most associated with the metabolomic differences observed.For instance, it positively correlated with 4-pyridoxic acid, a vitamin B6 metabolite, whose imbalance has been involved in tumor progression. 51Moreover B. breve also showed positive correlations with lipids mostly belonging to the fatty acids class.Some research has highlighted the antineoplastic role of fatty acids due to apoptotic induction in tumor cells or to the reduction of resistance to chemotherapy treatments. 52,53Furthermore, in a study conducted on patients with non-small-cell lung cancer and healthy individuals, B. breve was included among the intestinal bacteria capable of promoting progression-free survival in patients undergoing immunocombined chemotherapy. 42To date, further studies are required to investigate any interactions between intestinal bacteria and the metabolomic and lipidomic features that could characterize a pathophysiological condition.Therefore, since the gut microbiota is gaining an emerging role in the pathogenesis and development of cancer, we believe that investigating the microbiome to its full potential may open the way to new supportive approaches for the management of this devastating disease.

Figure 1 .
Figure 1.Gut microbiota composition at different taxonomic levels grouped by patients with or without metastases.Gut microbiota composition (i.e.mean relative abundance %) at phylum (a), family (b), genus (c), and species (d) levels in patients with pancreas tumor without (PC) and with metastases (PC met) at the enrollment.The "Others" category includes all bacteria whose mean relative abundance is less than 1% (at phylum and family level), 0.5% (at genus level) and 0.1% (at species level), respectively.As expected, the percentage of bacteria included in the "Others" category tends to increase as more specific taxa levels are considered.

Figure 2 .
Figure 2. Box plots of the Z-scores computed by PELORA at different taxonomic levels.Box plots of the Z-scores computed within each cluster (i.e.centroid) detected by PEnalized LOgistic Regression Analysis at phylum (a), family (b), genus (c), and species (D1) taxonomic hierarchy.As two different clusters of bacteria population were detected at species level, a scatter plot of the Z-scores computed within the two centroids was also shown (D2).Points depicted into the scatterplot represent the Z-scores pair, computed at each individual patient, and were filled with blue and red colors to denote PC and PC met patients (groups), respectively.Moreover, a polygon connecting the outermost data points is shown for each group to aid the visualization.

Figure 3 .
Figure 3. Heatmaps of relative abundance of bacterial populations identified by the PELORA within each cluster.Heatmaps of relative abundance (%) of bacterial populations identified by the PEnalized LOgistic Regression Analysis within each cluster at different taxonomic hierarchy.The order of the rows is determined by the main heatmap, which has been defined at the phylum level.All other heatmaps are automatically adjusted according to the settings in the main heatmap.

Figure 4 .
Figure 4. Results from iterative Random Forest of the most important bacteria detected at family level.(a) Variable importance plot is rescaled from 0% to 100% (relative VIMP) with respect to the maximum achieved value.Only variables with VIMP > 0 are shown and ranked from the most to the less important.(b) ROC curveshowing the sensitivity and specificity achieved at each possible cut-off of the predicted probabilities computed by the iRF from the out-of-bag data, and hence is not prone to overfitting.The area under the ROC curve (AUC) was corroborated by its 95% confidence interval, computed after 1000 stratified bootstrap replicates.The AUC quantifies the discriminatory power achieved by the iRF, which reflect the one achieved by the all variables reported in the VIMP plot.(c) Boxplots of the relative abundance (%) of all variables detected by the iRF with VIMP > 10%.(d) Accumulated Local Effect (ALE) of all variables detected by the iRF with VIMP > 10%.ALE describes how the variables influence the predicted probability on average.The gray band is a confidence band for the regression line fitted in the estimated ALE points.(e) Network plot of the most stable two-order interactions (i.e. with a stability score ≥0.5).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.

Figure 5 .
Figure 5. Results from iterative Random Forest of the most important bacteria detected at genus level.(a) Variable importance plot is rescaled from 0% to 100% (relative VIMP) with respect to the maximum achieved value.Only variables with VIMP > 0 are shown and ranked from the most to the less important.(b) ROC curveshowing the sensitivity and specificity achieved at each possible cut-off of the predicted probabilities computed by the iRF from the out-of-bag data, and hence is not prone to overfitting.The Area under the ROC curve (AUC) was corroborated by its 95% confidence interval, computed after 1000 stratified bootstrap replicates.The AUC quantifies the discriminatory power achieved by the iRF, which reflect the one achieved by the all variables reported in the VIMP plot.(c) Boxplots of the relative abundance (%) of all variables detected by the iRF with VIMP > 10%.(d) Accumulated Local Effect (ALE) of all variables detected by the iRF with VIMP > 10%.ALE describes how the variables influence the predicted probability on average.The gray band is a confidence band for the regression line fitted in the estimated ALE points.(e) Network plot of the most stable two-order interactions (i.e. with a stability score ≥0.5).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.

Figure 6 .
Figure 6.Results from iterative random forest of the most important bacteria detected at species level.(a) Variable importance plot is rescaled from 0% to 100% (relative VIMP) with respect to the maximum achieved value.Only variables with VIMP > 0 are shown and ranked from the most to the less important.(b) ROC curve showing the sensitivity and specificity achieved at each possible cut-offof the predicted probabilities computed by the iRF from the out-of-bag data, and hence is not prone to overfitting.The Area under the ROC curve (AUC) was corroborated by its 95% confidence interval, computed after 1000 stratified bootstrap replicates.The AUC quantifies the discriminatory power achieved by the iRF, which reflect the one achieved by the all variables reported in the VIMP plot.(c) Boxplots of the relative abundance (%) of all variables detected by the iRF with VIMP > 10%.(d) Accumulated Local Effect (ALE) of all variables detected by the iRF with VIMP > 10%.ALE describes how the variables influence the predicted probability on average.The gray band is a confidence band for the regression line fitted in the estimated ALE points.(e) Network plot of the most stable two-order interactions (i.e. with a stability score ≥0.5).The stability of a recovered interaction is defined as the proportion of times that interaction appears as an output of the generalized Random Intersection Trees, after a bootstrap perturbation of the data (i.e.0 = totally instable interaction,1=totally stable interaction).The higher the stability score, the better is the quality of the recovered interaction.(f) Partial Dependence Plot (PDP) produced for variables with top stable interactions (i.e.stability score >0.70).PDPs show the marginal (total) effect that two variables have on the predicted outcome.Colored zones locate those regions at which the metastatic event more likely occurs (green/yellow) and not occurs (blue/violet).Individual observations are plotted with respect to each variable combination.Relative abundances (%) values reported in c, d and f panels are in logit scale.

Figure 7 .
Figure 7. Functional prediction for gut microbial populations in PC and PC met patients.Significant KEGG pathways for gut microbiota between PC and PC met patients at level 1 (a) and 2 (b), respectively.Pathways were considered significant when the p-value from Mann-Whitney test was <0.05.

Figure 8 .
Figure 8. Fecal metabolic profile in PC and PC met patients.Heatmaps showing the significantly different (t-test, p < .05)metabolites between PC (blue squares) and PC met (red squares) patients.Polar metabolites (a) and lipidic metabolites obtained in positive polarity (b) and negative polarity (c) are shown, respectively.Shades of colors from blue to pink represent the abundance of each compound.

Figure 9 .
Figure 9. Associations between gut microbiota and the fecal metabolome.Spearman's rank correlations of the quantities of the significantly different polar metabolites (a) and lipids (b) from fecal samples with the intestinal bacteria discriminating the two groups of patients.Shades of colors from blue (negative correlation) to pink (positive correlation) indicate the value of the R correlation coefficient.

Table 1 .
Characteristics of patients with pancreatic tumor with and without metastases at the enrollment.

Table 2 .
Results from PEnalized LOgistic Regression Analysis (PELORA) in patients with PC without and with metastases.
Abbreviations: PC: patients with pancreatic tumor without metastasis at enrollment; PC met: patients with pancreatic tumor with metastasis at enrollment; IQR:Interquartile range (i.e.first-third quartiles); SD: Standard Deviation; Absent: all values are 0%.Standardized Z-score: